juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Fill `nh` 2024 gap

Open grossir opened this issue 1 year ago • 2 comments

Due to a bug in the scraper, the most recent opinion we have is from 2023. We are missing the whole year of 2024

We will need to implement a backscraper to solve this. Related to #929

On further inspection, it seems we are missing some opinions of earlier years. For example, for 2021 we have 57 opinions, but the source shows more than 63 results

grossir avatar Aug 21 '24 20:08 grossir

Fill gaps for nh_p

manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.nh_p --backscrape-start=2021 --backscrape-end=2024 --verbosity 3

Backscrape recently created nh_u

manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.nh_u --backscrape-start=2015 --backscrape-end=2024 --verbosity 3

grossir avatar Aug 26 '24 18:08 grossir

The backscraping was failing due to some edge cases

nh_p: field_document_file was null

    {
      "title": "2020-0268, Hampstead School Board & a. v. School Administrative Unit No. 55",
      "id": "25731",
      "fields": {
        "nid": [
          "25731"
        ],
        "langcode": [
          "en"
        ],
        "revision_log": null,
        "title": [
          "2020-0268, Hampstead School Board & a. v. School Administrative Unit No. 55"
        ],
        "sticky": [
          "0"
        ],
        "moderation_state": [
          "published"
        ],
        "field_date_filed": null,
        "field_date_posted": [
          "2021-04-20"
        ],
        "field_date_revised": null,
        "field_description": null,
        "field_document": [
          {
            "id": "21666",
            "title": "Order withdrawing the opinion in case 2020-0268, Hampstead School Board & a. v. School Administrative Unit No. 55",
            "type": "document",
            "fields": {
              "nid": [
                "21666"
              ],
              "uuid": [
                "ea099976-8d32-4d8e-9102-1d08e2a9bc2c"
              ],
              "vid": [
                "57141"
              ],
              "langcode": [
                "en"
              ],
              "type": null,
              "revision_timestamp": [
                "1629236157"
              ],
              "revision_uid": null,
              "revision_log": null,
              "status": [
                "1"
              ],
              "uid": null,
              "title": [
                "Order withdrawing the opinion in case 2020-0268, Hampstead School Board & a. v. School Administrative Unit No. 55"
              ],
              "created": [
                "1628522043"
              ],
              "changed": [
                "1629236157"
              ],
              "promote": [
                "0"
              ],
              "sticky": [
                "0"
              ],
              "default_langcode": [
                "1"
              ],
              "revision_default": [
                "1"
              ],
              "revision_translation_affected": [
                "1"
              ],
              "moderation_state": [
                "published"
              ],
              "metatag": null,
              "path": [
                "{\"alias\":\"\\/documents\\/order-withdrawing-opinion-case-2020-0268-hampstead-school-board-v-school-administrative\",\"pid\":\"14471\",\"langcode\":\"en\"}"
              ],
              "publish_on": null,
              "unpublish_on": null,
              "publish_state": null,
              "unpublish_state": null,
              "menu_link": null,
              "field_date_filed": null,
              "field_date_posted": [
                "2021-06-02"
              ],
              "field_date_revised": null,
              "field_description": null,
              "field_document": null,
              "field_document_category": null,
              "field_document_file": {
                "0": {
                  "id": "14511",
                  "title": "",
                  "type": "document",
                  "fields": {
                    "fid": [
                      "14511"
                    ],
                    "uuid": [
                      "dd9b7dd2-4feb-4997-9996-58226ba1315c"
                    ],
                    "langcode": [
                      "en"
                    ],
                    "uid": null,
                    "filename": [
                      "6-1-2021-order.pdf"
                    ],
                    "uri": [
                      "public://documents/2021-08/6-1-2021-order.pdf"
                    ],
                    "filemime": [
                      "application/pdf"
                    ],
                    "filesize": [
                      "57180"
                    ],
                    "status": [
                      "1"
                    ],
                    "created": [
                      "1628522043"
                    ],
                    "changed": [
                      "1628522071"
                    ]
                  }
                },
                "alt": ""
              },
              "field_document_number": null,
              "field_document_purpose": null,
              "field_document_subcategory": null,
              "field_entity_tags": null,
              "field_judge": null,
              "field_link_url": null,
              "field_parties": null,
              "field_permissions": null,
              "field_tag": null
            }
          }
        ],
        "field_document_category": null,
        "field_document_file": null,
        "field_document_number": null
        }}

nh_u: field_date_posted was null

    {
      "title": "2015-0082, Marissa Rattee v. Andre Bertolino",
      "id": "38281",
      "fields": {
        "nid": [
          "38281"
        ],
        "langcode": [
          "en"
        ],
        "revision_log": null,
        "title": [
          "2015-0082, Marissa Rattee v. Andre Bertolino"
        ],
        "sticky": [
          "0"
        ],
        "moderation_state": [
          "published"
        ],
        "field_date_filed": [
          "2015-05-21"
        ],
        "field_date_posted": null,
        "field_date_revised": null,
        "field_description": null,
        "field_document": null,
        "field_document_category": null
        }

grossir avatar Sep 02 '24 16:09 grossir

Running the nh_u backscraper, it still needs some fixes

2024: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 156/188 opinions.
2023: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 137/171 opinions
2022: Error: docket_str = fields["field_description"][0]["#text"] raises TypeError: 'NoneType' object is not subscriptable
2021: 0 results, this is abug
2020: 0 results, this is a bug
2019: 0 results, this is a bug
2018: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 134/134 opinions.
2017: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 165/165 opinions.
2016: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 191/191 opinions.
2015: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 223/224 opinions.

grossir avatar Nov 12 '24 03:11 grossir

Running the nh_p backscraper:

2021: DEBUG juriscraper.opinions.united_states.state.nh_p: Successfully crawled 26/63 opinions.
2022: DEBUG juriscraper.opinions.united_states.state.nh_p: Successfully crawled 23/74 opinions.
2023: DEBUG juriscraper.opinions.united_states.state.nh_p: Successfully crawled 46/61 opinions.
2024: DEBUG juriscraper.opinions.united_states.state.nh_p: Successfully crawled 26/63 opinions.

For 2024 it has introduced some new versions of documents. For example: new , old

The new one has slight modifications, such as "DONOVAN and COUNTWAY, JJ., concurred; HANTZ MARCONI, J., sat for oral argument but did not participate in the final vote." vs "DONOVAN and COUNTWAY, JJ., concurred."

This is another instance of the versioning issue https://github.com/freelawproject/courtlistener/issues/3803

grossir avatar Nov 12 '24 03:11 grossir

For the failing nh_u years

2019: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 152/156 opinions.

2020: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 147/148 opinions.

2021: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 116/116 opinions.

2022: DEBUG juriscraper.opinions.united_states.state.nh_u: Successfully crawled 125/125 opinions.

grossir avatar Jan 08 '25 17:01 grossir