MetPy icon indicating copy to clipboard operation
MetPy copied to clipboard

invalid characters in CODSUS text causes parse_wpc_surface_bulletin to fail

Open ahaberlie opened this issue 5 months ago • 8 comments

What went wrong?

Metpy version: 1.7.1 Python: 3.13.5 (conda environment)

Summary: using the Iowa State archive of WPC surface front text files, we encountered an error related to invalid characters in the lat/lon fields. While this would be trivial to fix for one day, these invalid characters are peppered throughout the archive. If you use a year or more of data, it becomes difficult to track down the issues.

Steps to download example data that produces the error:

  1. Go to: https://mesonet.agron.iastate.edu/wx/afos/p.php?pil=CODSUS
  2. Download Text
  3. Read in the file using parse_wpc_surface_bulletin

When investigating the file, you see that " is inserted in the coded coordinate:

COLD WK 44135 42138 411"45 37152 35155 33161

The invalid character in some cases appears to be "fixed", where you get an "updated" text product for the same forecast and valid time with the invalid character removed:

https://mesonet.agron.iastate.edu/wx/afos/p.php?pil=CODSUS&e=200005260728

It does appear as though a malformed front label (TROF, COLD, etc.) will result in the parser ignoring that line. It might be good to let the user know this is happening. For example, this text file has a label TRmOF that does not wind up in the DataFrame:

https://mesonet.agron.iastate.edu/wx/afos/p.php?pil=CODSUS&e=200003290728

Possible fixes (maybe a utility function clean_wpc_surface_bulletin):

  1. replace lowercase letters with ""
  2. replace punctuation with ""

Example function:

import string

def clean_wpc_surface_bulletin(input_path, output_path=None):
    """Remove common invalid characters from WPC surface bulletin.
    Specifically, this function will remove any lowercase letters
    and punctuation. This function could help fix exceptions when
    running parse_wpc_surface_bulletin and keep some cases where
    an invalid character is handled by removing the entire line
    from the resulting DataFrame.
    
    Parameters
    ----------
    bulletin : file-like object
        file-like object that will be read from directly.
    output_path : str
        location at which to write out the cleaned version of the file. 
        If None, the resulting text will be returned.

    Returns
    -------
    cleaned_text: str
        If output_path is None, this will return the cleaned text. 
        Otherwise, return None.
    """
    remove_chars = string.ascii_lowercase + string.punctuation
    
    with open(input_path, "r", encoding="utf-8") as f:
        text = f.read()

    cleaned_text = "".join(ch for ch in text if ch not in remove_chars)
    
    if output_path:
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(cleaned_text)
    else:
        return cleaned_text

This works for many cases, except in cases like https://mesonet.agron.iastate.edu/wx/afos/p.php?pil=CODSUS&e=200012271924

Where you get a line like: WARM WK 4627 4324 I4222

Problems with cleaning function and cleaning in general:

  1. This has to undergo significant testing to make sure that "good" cases are not removed.
  2. There are some otherwise valid characters that are misplaced. This could require parse-time cleaning, where the invalid characters are checked based on the line splits, e.g., once that line is split, write a rule to check certain indexes for invalid characters: ['WARM', 'WK', '4627', '4324', 'I4222']
  3. There may be some cases where you should ignore an initial forecast and only use the updated forecast.

Operating System

Linux

Version

1.7.1

Python Version

3.13.5

Code to Reproduce

from metpy.io import parse_wpc_surface_bulletin

df = parse_wpc_surface_bulletin("200010190721-KWBC-ASUS1 -CODSUS.txt", year=2000)

Errors, Traceback, and Logs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 3
      1 from metpy.io import parse_wpc_surface_bulletin
----> 3 df = parse_wpc_surface_bulletin("200010190721-KWBC-ASUS1 -CODSUS.txt", year=2000)
      5 df

File ~/.conda/envs/metpy_test/lib/python3.13/site-packages/metpy/io/text.py:139, in parse_wpc_surface_bulletin(bulletin, year)
    136     strength, boundary = np.nan, info
    138 # Create a list of Points and create Line from points, if possible
--> 139 boundary = [Point(_decode_coords(point)) for point in boundary]
    140 boundary = LineString(boundary) if len(boundary) > 1 else boundary[0]
    142 # Add new row in the data for each front

File ~/.conda/envs/metpy_test/lib/python3.13/site-packages/metpy/io/text.py:60, in _decode_coords(coordinates)
     58 # Insert decimal point at the correct place and convert to float
     59 lat = float(f'{lat[:2]}.{lat[2:]}') * flip
---> 60 lon = -float(f'{lon[:3]}.{lon[3:]}')
     61 return lon, lat

ValueError: could not convert string to float: '"45.'

ahaberlie avatar Sep 25 '25 12:09 ahaberlie

@ahaberlie curator of the IEM archive here... I would be more than happy to clean out these stray characters from the text archive. Should I remove any " character and replace it with nothing? Anything else that I can systematically do?

akrherz avatar Sep 25 '25 12:09 akrherz

Hi @akrherz , did not mean to call you out or give you more work! As always, thank you for providing these and other datasets!

I do not want to unilaterally request changes that could cause downstream issues without a lot of testing. For example, the parser in MetPy seems to ignore TROmF cases, but keeps TROF. I don't know if there are cases like this the other direction (removing seemingly invalid chars causes "good" examples to be ignored)

I think that lowercase letters and punctuation should be removed (I provided a first draft function to do this above.. not tested thoroughly though). There are weird cases though where an upper-case letter is inserted into coordinates (see example above). This makes the logic to identify these cases a little more difficult, but I might have a blind spot there for solutions.

If this is an ongoing issue where the ingested data have these invalid characters, it might be good to have this function anyway, in case people get the data from a different source.

ahaberlie avatar Sep 25 '25 13:09 ahaberlie

No worries @ahaberlie . I did some bulk checks and found " only for products prior to 2001. The presence of m stopped in 2003. These stray characters are likely coming from a "binary" dataset I was given by NCEI that I had to reverse engineer as there was no documentation for. Individual products were split into smaller message chunks that I often found stray single character garbage at the beginning or the end of the chunk. I wasn't clever enough to figure out when these stray characters were real or not :( I'll sit tight for a moment and run some more checks over this later. I don't like shunting problems like these downstream of me when they can be fixed once and for all at the source :/

akrherz avatar Sep 25 '25 13:09 akrherz

@akrherz I put together a table that shows the filename and invalid character for 2000. If you want me to do all of the years, I can do that.

EDIT: just wanted to clarify that this was done using the "zip" option, which seemed more useful for identifying specific files with invalid characters.

cases_where_parse_fails.csv

ahaberlie avatar Sep 25 '25 15:09 ahaberlie

I'm happy to update the MetPy parser if there are general problems that occur in the original products we should account for. My concern going beyond that would be that broadening the parser causes us to incorrectly parse things or turn junk into something that looks like data.

Right now, this seems like the problems are specific to the ISU archive, and given @akrherz is willing to clean things up, the best path forward seems to be to address them upstream.

Unless I'm missing something?

dopplershift avatar Sep 25 '25 16:09 dopplershift

You are not missing anything, @dopplershift.

Thank you again @akrherz for looking into this. I will send a follow up email.

ahaberlie avatar Sep 25 '25 16:09 ahaberlie

I think there is an enhancement we can make here in terms of being more fault-tolerant, if not aggressively cleaning. Since we're iterating line-by-line, we should catch ValueErrors and issue a warning, but continue parsing the rest of the file. Could be a bit messy if someone hands the function complete garbage, but would make the issue here a much nicer experience.

dopplershift avatar Sep 25 '25 16:09 dopplershift

So I updated:

  • 174 products in the archive to remove the " character.
  • 2 products to replace TROmF with TROF

Looking at your csv failure file, those are generally one offs without a simple regsub fix. I'm writing something now to clean those and will update once done.

akrherz avatar Sep 25 '25 20:09 akrherz