polars icon indicating copy to clipboard operation
polars copied to clipboard

Fixed-width text file reader

Open DrMaphuse opened this issue 3 years ago • 4 comments

Describe your feature request

Pandas implements a function for reading fixed-width text files, which are produced, for example, by some SQL queries. It would be neat to have this in polars as well.

DrMaphuse avatar Apr 15 '22 12:04 DrMaphuse

As a workaround, I am using this code to read a fixed-width file. It's more than twice as fast as pandas' "read_fwf" for me when reading large files.

df = pl.read_csv(file_path)
df.columns = ['full_str']

for w, col in zip(fwidths, column_names):
    df = df.with_column(pl.col("full_str").str.slice(0, w).str.rstrip().alias(col))
    df = df.with_column(pl.col("full_str").str.slice(w, None).alias("full_str"))
df = df.drop('full_str')

jicart avatar Jan 19 '23 16:01 jicart

$ cat fwf_test.txt
NAME                STATE     TELEPHONE  
John Smith          WA        418-Y11-4111
Mary Hartford       CA        319-Z19-4341
Evan Nolan          IL        219-532-c301

This will probably perform faster:

df = pl.read_csv(
    "fwf_test.txt",
    has_header=False,
    skip_rows=1,
    new_columns=["full_str"]
)

column_names = [ "NAME", "STATE", "TELEPHONE" ]
widths = [20, 10, 12]


# Calculate slice values from widths.
slice_tuples = []
offset = 0

for i in widths:
    slice_tuples.append((offset, i))
    offset += i

df.with_columns(
    [
       pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip().alias(col)
       for slice_tuple, col in zip(slice_tuples, column_names)
    ]
).drop("full_str")
shape: (3, 3)
┌───────────────┬───────┬──────────────┐
│ NAME          ┆ STATE ┆ TELEPHONE    │
│ ---           ┆ ---   ┆ ---          │
│ str           ┆ str   ┆ str          │
╞═══════════════╪═══════╪══════════════╡
│ John Smith    ┆ WA    ┆ 418-Y11-4111 │
│ Mary Hartford ┆ CA    ┆ 319-Z19-4341 │
│ Evan Nolan    ┆ IL    ┆ 219-532-c301 │
└───────────────┴───────┴──────────────┘

ghuls avatar Jan 19 '23 17:01 ghuls

This will probably perform faster:

Very nice! Opening a 500MiB text file: pandas' read_fwf: ~48 secs my code: ~24 secs your code: ~5.5 secs

jicart avatar Jan 19 '23 17:01 jicart

Here is my preferred implementation for the time being:


def read_fixed_width_file_as_strs(file_path: Union[Path, str], col_names_and_widths: Dict[str, int], *, skip_rows: int = 0) -> pl.DataFrame:
	"""
	Reads a fixed-width file into a dataframe.
	Reads all values as strings (as indicated by function name).
	Strips all values of leading/trailing whitespaces.

	Args:
		col_names_and_widths: A dictionary where the keys are the column names and the values are the widths of the columns.
	"""

	# Source: adapted from https://github.com/pola-rs/polars/issues/3151#issuecomment-1397354684

	df = pl.read_csv(
		file_path,
		has_header=False,
		skip_rows=skip_rows,
		new_columns=["full_str"],
	)

	# transform col_names_and_widths into a Dict[cols name, Tuple[start, width]]
	slices: Dict[str, Tuple[int, int]] = {}
	start = 0
	for col_name, width in col_names_and_widths.items():
		slices[col_name] = (start, width)
		start += width

	df = df.with_columns(
		[
			pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip_chars().alias(col)
			for col, slice_tuple in slices.items()
		]
	).drop(["full_str"])

	return df

DeflateAwning avatar Mar 17 '24 23:03 DeflateAwning

FWIW, I am using str.extract_groups to read/parse a fwf file https://stackoverflow.com/a/78545671/2383070

blaylockbk avatar May 28 '24 17:05 blaylockbk