arxivscraper Issue with date_from and date

I copied the following url from the output of the program. The url looks for records between dates 2019-01-01 and 2019-05-10.

URL: http://export.arxiv.org/oai2?verb=ListRecords&from=2019-01-01&until=2019-05-10&metadataPrefix=arXiv&set=cs

But lot of records I got lie outside this date range (e.g. the first record which is from year 2007)

Am I missing something? I am not sure if the issue is with the code or with the arxiv api.

May 12 '20 06:05 vgoel38

Having the same issue, only 10% of the returned papers were within the requested date-range

Jun 17 '20 10:06 thecheeseontoast

@vgoel38 and @thecheeseontoast : Thank you for raising the issue. The scraper returns two date columns for each record:

created
updated

If updates date is within the specified range, it still returns that record even when created date is out of the range. ArXiv specifically mentions this here:

Every OAI-PMH metadata record has a datestamp associated with it, which is the last modification time of that record. Because arXiv has updated metadata records in bulk on several occasions, the OAI-PMH datestamp values do not correspond with the original submission or replacement times for older articles, and may not for newer articles because of administrative and bibliographic updates. The earliest datestamp is given then the <earliestDatestamp> element of the Identify response.

If it would be something useful, I can slightly modify the behavior to use earliestDatestamp in addition to the last datastamp.

Jun 22 '20 04:06 Mahdisadjadi

I notice that even some dates in the "updated" section are out of the range

Jun 22 '20 21:06 ChakreshIITGN

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the <earliestDatestamp> through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

Jun 24 '20 00:06 Mahdisadjadi

Hey. Great tool guys!. I found a bug with the Record._get_authors method where sometimes the author tag doesn't have forenames.

Bug Reproduction :

import arxivscraper
scraper = arxivscraper.Scraper(category='cs', date_from='2020-06-25',date_until='2020-06-27')

output = scraper.scrape()

Jun 27 '20 02:06 valayDave

@valayDave : Did you use pip to install or the repo?

Jul 06 '20 01:07 Mahdisadjadi

I installed with pip not from the source.

Jul 07 '20 00:07 valayDave

@valayDave Sorry pip version is lagging but this issue should be fixed in source.

Jul 07 '20 00:07 Mahdisadjadi

@valayDave pip version is updated to the latest, so this bug should be fixed.

Sep 19 '20 18:09 Mahdisadjadi

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

@Mahdisadjadi One way to get around this which I thought of was: The get_metadata() method has a time key in its dictionary output for every record. This time is the original time of submission. Thus, we can pass the value of this key (time) as a conditional checker to from and until

Sep 20 '20 10:09 csrajath

Issue with date_from and date_until