Weirdly specific failure to parse at exact position based on what's before it, from pipe() input but not from path
tldr; Parsing seems to fail for a newline on line 6833 for this exact file when it is input using pipe, but works whenever I insert, remove, replace any character before the newline on line 6833 - EXCEPT if the replacement is an ASCII character!
This is very weirdly specific, so I thought that maybe it'd be useful as an edge edge edge case....
Hope I'm in the right place.
Background
This is a SAM format file output by minimap2 (as exactly specified on line 1101): weird_subset_woheader.txt I have removed the header (usually starts with "@"). There are variable numbers of tab-delimited columns! Different lines have different numbers of columns!
I got some weird results with this file in my analysis pipeline, so I've spent the past few hours trying to figure out where.
For later, file -bi weird_subset_woheader.txt returns text/plain; charset=us-ascii.
A one-liner to experiment with
I've narrowed it down to this minimal example if you want it on one line:
Rscript --vanilla -e 'readr::read_tsv(pipe("cat weird_subset_woheader.txt"),col_names=F)[6830:6840,]'
What I expect vs what I get
That should give me a nice table, and every first field should start with "sample". However, it does a thing where it seems to (speculating here) escape the newline from field 12, fill field 13 with NA, then eat the first character of the subsequent line and then parse it along until it hits the next (defective?) newline.
For an example of a "good" output, see below "reprex" block for an example using read.table.
What doesn't "fix" it
- If I change the file in any way past the newline on line 6833, or if I change the file in any way in the "@"-denoted header, including if I insert a Unicode character past this point.
What does "fix" it.....
- If I use the same code to
pipethe "cat" output intoread.table, with arguments tofill=Tandheader=F, then it runs fine (no first column values are exactly "*", all good). - If I provide the file as a path, ie just put the path name in the first position, not in
pipe. Requirespipe! - If I remove line 6833
- If I remove, insert, or replace any character before this newline with a Unicode character using copy-and-paste into vim insert/replace mode.
What I want
I'm good for my pipeline, easy fix to work around this. But I'd like to understand what is going on here, or how I can better dig into this specific point in the file to learn more, see if there's better ways of debugging this and making sure I don't run into it again.
Is this weird or did I miss something?
reprex and version info
readr::read_tsv(pipe("cat weird_subset_woheader.txt"),col_names=F)[6830:6840,]
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 6898 Columns: 22
#> ?????? Column specification ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
#> Delimiter: "\t"
#> chr (22): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
#>
#> ??? Use `spec()` to retrieve the full column specification for this data.
#> ??? Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 11 ?? 22
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 samp??? 0 pool??? 29 60 244M * 0 0 ACCA??? * NM:i??? ms:i???
#> 2 samp??? 0 pool??? 29 60 244M * 0 0 AGAT??? * NM:i??? ms:i???
#> 3 samp??? 0 pool??? 29 60 244M * 0 0 AAAT??? * NM:i??? ms:i???
#> 4 samp??? 4 * 0 0 * * 0 0 ACAA??? * rl:i??? <NA>
#> 5 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 6 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 7 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 8 * rl:i??? <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 9 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 10 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 11 * rl:i??? <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> # ??? with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> # X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
readr::problems()
read.table(pipe("cat weird_subset_woheader.txt"),header=F,fill=T)[6830:6840,]
#> V1 V2 V3 V4 V5
#> 6830 sample3_B_K17_ACAATAAGTGGTAAGAAGGTTCATCC_6to6 0 poolOne_45 29 60
#> 6831 sample25_A_L8_CGATTAACATTTAAGAACTTTTAAAT_19to18 0 poolFour_330 29 60
#> 6832 sample6_B_A14_CAAGTAACCTAAAATGATATTATGTC_14to16 0 poolOne_70 29 60
#> 6833 sample19_A_E23_AGAGGAAATGGTAATGTGCTTGCGAA_7to9 0 poolTwo_130 29 60
#> 6834 sample8_A_H19_CGCGTAATCCACAAAAATCTTTCCGC_7to8 0 poolOne_99 29 60
#> 6835 sample5_A_P13_ATAACAATCGAGAAAAAGGTTAAGGG_8to8 0 poolFour_343 29 60
#> 6836 sample15_A_A16_ACCGCAAACGGTAAATGTATTCTAAG_14to14 0 poolEight_730 29 60
#> 6837 sample22_B_A8_AACGGAACTTTTAATCTCCTTACCAA_27to35 4 * 0 0
#> 6838 sample21_B_N2_ATCCAAAAAAATAATGAATTTGTTAC_21to26 0 poolNine_806 29 60
#> 6839 sample21_B_G15_CGCCAAAACACGAATCCGGTTCCCGA_18to23 0 poolTen_911 29 60
#> 6840 sample30_B_G9_TCCCTAAGGAGAAAAGTGTTTGGCCC_17to15 0 poolNine_889 29 60
#> V6 V7 V8 V9
#> 6830 244M * 0 0
#> 6831 244M * 0 0
#> 6832 244M * 0 0
#> 6833 244M * 0 0
#> 6834 244M * 0 0
#> 6835 244M * 0 0
#> 6836 244M * 0 0
#> 6837 * * 0 0
#> 6838 244M * 0 0
#> 6839 244M * 0 0
#> 6840 244M * 0 0
#> V10
#> 6830 AGAGCTTCACATGCAGCAGCTGATCTAATTTTCCCATGTGTTCAGTGAGGTTGGGTGCCACGAAGCCTGTTTTATATAGGCAGGTTTTATCTGCATAGGAACTAAGAAAGACATTCTTGCCAAAGACTACAAGGGTCTGCCTGACCTGGCCTCTGCCATTTGGTAAGAAAACATTTAACAGCCATGCCAAAGTGCTTTTTACTTCTTAAACTGCCCAGATTTTCCTAACAAAGCTTTTGTATTC
#> 6831 AAATATCTACTAGATTCATTTGTTCTGTAGTGCAGATGAAGTTTGATGTCTCTGTGTTGATTTTCTCTAAAGAAGATCTGTTCAGTTCTGAAAGTGGGGTGTTGAAGTCTCCAACTATTATTGTATTGGGGCCTATTTCTCTCTTTAGCTCTAATAGTATTTACTTTATTTATCTTGGTGCTTCAGTGTTTGGTGAATATATATTTAAAATTGTTAAAACCTATTGCAGAATTGGCCCTTTTAT
#> 6832 AAAAGTTTCAAAAAAATTAAAATTATATCAAGTATTTTCTCTGACCACAATGGAATAAAGCTAGAAATCAATAACAAGAGGAATTTTGGAAACTATACAGCACATAGAAATTAAACAGTATGCTTCTGAATGAACAGTGTGTCAATGAAGAAATTACTAATAAGAAGGAAATTTTTTAAATTCTTGCAACAAATGAAAATAGAAACACAATATACCAAAAGTATGGAATGCAGTGAAAGCAGTA
#> 6833 ACTCTCACACTCGCACTCACTCTTAAATAAGATTAAGATCCTGTTTCAAATCCTTCTCCTCCTCCTCTTCATTTGATGTGTCGTTTTTCTCTTTATTTCCCTATCACTTGTTTTTTGAAAAACTAAATTGGCTCTGCTATTGACATAACATTATGTGTAATAGATTTCAAATTTGGAGGTATTTGGATTTTTTTCTATTTTAAAATGATTTCCTTGTTTATATTTTTGCTTACTTAGACAAAGA
#> 6834 ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATCTTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGC
#> 6835 AAATCCTGAATATCTTTGTTAATTTTCTGTCTCAATGATTTGTATAATATTGACAGTGGGGAGTTAAAGTCTCCCACTATTATTGTGTAGGAGTCGAAGTCTCTTTGTAGGTCTCTAAGAACTTGTTTTATGAATCTGAGTGCTCCTGTATTGGGTACATGTACATTTAGGATAGTTAGCTCTCCTTGTTGAATTGAACCCTTTACCATTACGTAATGCCCTTCTTTGTCTTTTTTAATCTTTG
#> 6836 AATTACTCATGATTTCTAGTCACTGATGAGACTTTCTCCTATTTTCCAGGGCTTGTTGGGCTTATCCTTGTAATATTAAAAATATTTGGATGGGAGGATCAGCGGTCATCTGTGTTCAGTCTATTATCTTGAACCAATTCCAGTATAGTTGCTGCTGTATTTAAGAAAAATTTTATGCCTGTTTTATAACTTTCTGCCCCATTAAATTATCTTTTTTTGGACTAATAGTTTTGTATTAATTGTT
#> 6837 CTCGCCTGCAGGATGCCCGGGCAT
#> 6838 ACAATATATTAGTGTTGCATATCCTCAATATTCATTTAGATTAATCCATACAATTACTCTTTCTATTGTCTTTCACTCCTTACTGAATTTCTAGGCTTTTATCTGAAATTATTTTCCTTCTGCCTGAAGAACTACCTTTTATAGTAATTTTAGTGTAATTCTGCTTAGGAAGTATTATCTGTTCTTGTTATTCTGACAACATTTTCATTTCTCCTTCATTTTTGCAGGATGTTTTTGCTGGGTG
#> 6839 ACCAATTTATATTTAAGGCAAAAATACTCAACAATTTACTCGTGTAGATAAAATTTGTTTTTACTGTTGTGTAGTATGCTATTTTATAAATAGGTGATATTTTATGTGTTTCTTTTTCTTGTAGTGTTTGGGATATCGACTAATGTTTATGAGAGACACAATTTTGACTATTCTTGAAAATGAGATTTTGAGTACATGTGACTTTTTGAGTACACATTCTCCAACTTATAAATCTAGAAGATAA
#> 6840 ACATCTTATTAAAATGTTGTTAATAATTACTTGAACAAGTACTATTTGAACGCCTATGATATTCTATGAAAAATTCTATTAATCCTTTACAGCTCAGTTAATACTACCATCTCACCTCAGCTTTATATGAGCATCTCAGCAAAGATTCCTTCCTTCTTTGTGTTTCTGGTCTACTTAATTCATGCTGAGTACTTTGCATAAACCGCATACATAGAACTGTGTATATTTATGTGTATATTCATTT
#> V11 V12 V13 V14 V15 V16 V17 V18 V19
#> 6830 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:25 s1:i:227 s2:i:0
#> 6831 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:20 s1:i:216 s2:i:0
#> 6832 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:23 s1:i:232 s2:i:0
#> 6833 * NM:i:1 ms:i:224 AS:i:224 nn:i:0 tp:A:P cm:i:20 s1:i:215 s2:i:0
#> 6834 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:23 s1:i:237 s2:i:0
#> 6835 * NM:i:1 ms:i:224 AS:i:224 nn:i:0 tp:A:P cm:i:19 s1:i:207 s2:i:0
#> 6836 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:24 s1:i:237 s2:i:0
#> 6837 * rl:i:0
#> 6838 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:22 s1:i:230 s2:i:0
#> 6839 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:20 s1:i:225 s2:i:0
#> 6840 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:22 s1:i:229 s2:i:0
#> V20 V21 V22
#> 6830 de:f:0 cs:Z::244 rl:i:0
#> 6831 de:f:0 cs:Z::244 rl:i:0
#> 6832 de:f:0 cs:Z::244 rl:i:0
#> 6833 de:f:0.0041 cs:Z::48*ga:195 rl:i:0
#> 6834 de:f:0 cs:Z::244 rl:i:0
#> 6835 de:f:0.0041 cs:Z::39*ct:204 rl:i:0
#> 6836 de:f:0 cs:Z::244 rl:i:0
#> 6837
#> 6838 de:f:0 cs:Z::244 rl:i:0
#> 6839 de:f:0 cs:Z::244 rl:i:0
#> 6840 de:f:0 cs:Z::244 rl:i:0
z <- readr::read_tsv(pipe("cat weird_subset_woheader.txt"),col_names=F)
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 6898 Columns: 22
#> ?????? Column specification ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
#> Delimiter: "\t"
#> chr (22): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
#>
#> ??? Use `spec()` to retrieve the full column specification for this data.
#> ??? Specify the column types or set `show_col_types = FALSE` to quiet this message.
z[z$X1=="*",]
#> # A tibble: 65 ?? 22
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 2 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 3 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 4 * rl:i??? <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 6 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 7 * rl:i??? <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 8 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 9 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> 10 * NM:i??? ms:i??? AS:i??? nn:i??? tp:A??? cm:i??? s1:i??? s2:i??? de:f??? cs:Z??? rl:i??? <NA>
#> # ??? with 55 more rows, and 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>,
#> # X17 <chr>, X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
z <- read.table(pipe("cat weird_subset_woheader.txt"),header=F,fill=T)
z[z$X1=="*",]
#> [1] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#> [20] V20 V21 V22
#> <0 rows> (or 0-length row.names)
Created on 2022-09-08 with reprex v2.0.2
R is --version 4.2.1
readr info from installed.packages() :
Package
"readr"
LibPath
"/home/zed/R/x86_64-pc-linux-gnu-library/4.2"
Version
"2.1.2"
Priority
NA
Depends
"R (>= 3.1)"
Imports
"cli (>= 3.0.0), clipr, crayon, hms (>= 0.4.1), lifecycle (>=\n0.2.0), methods, R6, rlang, tibble, utils, vroom (>= 1.5.6)"
LinkingTo
"cpp11, tzdb (>= 0.1.1)"
Suggests
"covr, curl, datasets, knitr, rmarkdown, spelling, stringi,\ntestthat (>= 3.1.2), tzdb (>= 0.1.1), waldo, withr, xml2"
Enhances
NA
License
"MIT + file LICENSE"
License_is_FOSS
NA
License_restricts_use
NA
OS_type
NA
MD5sum
NA
NeedsCompilation
"yes"
Built
"4.2.0"
If you replace a character before the newline with a unicode character, then delete another ascii character, then it also has the problem. I think maybe this is related to the thing where ascii is half a byte, and unicode is a byte or more? Does that seem like a useful model to test?
If I used head -n 6833 to get everything up to (and including?) the newline, that is 2752492 bytes. It's divisible by 4... but doesn't look like an interesting number otherwise.
Is there a buffer that's off by one, something in pipe?
I'm able to reproduce the problem you are seeing where rows starting at 6834 are being parsed incorrectly but only when using pipe("cat file.txt"):
read_tsv("weird_subset_woheader.txt",
col_names = FALSE,
show_col_types = FALSE
)[6834:6837, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> # A tibble: 4 × 22
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
#> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 samp… 0 pool… 29 60 211M… * 0 0 AATT… * NM:i… ms:i…
#> 2 samp… 0 pool… 29 60 244M * 0 0 GTCC… * NM:i… ms:i…
#> 3 samp… 0 pool… 29 60 244M * 0 0 ACCA… * NM:i… ms:i…
#> 4 samp… 4 * 0 0 * * 0 0 ATCG… * rl:i… <NA>
#> # … with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> # X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
#> # ℹ Use `colnames()` to see all variable names
# odd behavior when using pipe and cat where data is stuffed into incorrect columns
read_tsv(pipe("cat weird_subset_woheader.txt"),
col_names = FALSE,
show_col_types = FALSE
)[6834:6837, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> # A tibble: 4 × 22
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 * NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA>
#> 2 * NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA>
#> 3 * NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA>
#> 4 * rl:i… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> # … with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> # X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
#> # ℹ Use `colnames()` to see all variable names
But I can also see that there are parsing problems which makes it difficult to discern what the cause is.
problems(read_tsv(pipe("cat weird_subset_woheader.txt"),
num_threads = 1,
col_names = FALSE,
show_col_types = FALSE
))
#> # A tibble: 1,085 × 5
#> row col expected actual file
#> <int> <int> <chr> <chr> <chr>
#> 1 34 12 22 columns 12 columns ""
#> 2 36 12 22 columns 12 columns ""
#> 3 73 12 22 columns 12 columns ""
#> 4 79 12 22 columns 12 columns ""
#> 5 86 12 22 columns 12 columns ""
#> 6 88 12 22 columns 12 columns ""
#> 7 90 23 22 columns 23 columns ""
#> 8 91 23 22 columns 23 columns ""
#> 9 95 12 22 columns 12 columns ""
#> 10 98 12 22 columns 12 columns ""
#> # … with 1,075 more rows
#> # ℹ Use `print(n = ...)` to see more rows
It looks like you already knew this.
There are variable numbers of tab-delimited columns! Different lines have different numbers of columns!
But I did want to point out that in your example, you're losing some information since it's not parsing the 23rd column. As you noted, this is a situation that read.table() seems to handle your data better, but even better would be to give names to the number of columns present in your data.
# rows 90 and 91 have 23 columns per row
# read_tsv does not generate a 23rd column
read_tsv(pipe("cat weird_subset_woheader.txt"),
show_col_types = FALSE,
col_names = paste0("V", 1:23)
)[90:93, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> # A tibble: 4 × 22
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 samp… 0 pool… 26 60 247S… * 0 0 ACTA… * NM:i… ms:i…
#> 2 samp… 2048 pool… 29 60 250M… * 0 0 ACTA… * NM:i… ms:i…
#> 3 samp… 0 pool… 29 60 244M * 0 0 ACAA… * NM:i… ms:i…
#> 4 samp… 0 pool… 29 60 244M * 0 0 AGAT… * NM:i… ms:i…
#> # … with 9 more variables: V14 <chr>, V15 <chr>, V16 <chr>, V17 <chr>,
#> # V18 <chr>, V19 <chr>, V20 <chr>, V21 <chr>, V22 <chr>
#> # ℹ Use `colnames()` to see all variable names
# read.table handles this better
# if you supply column names
read.table(pipe("cat weird_subset_woheader.txt"),
header = F, fill = T,
col.names = paste0("V", 1:23)
)[90:93, ]
#> V1 V2 V3 V4 V5
#> 90 sample21_A_F3_TCTAAAATCGCAAAAAACTTTACGCA_13to15 0 poolFour_343 26 60
#> 91 sample21_A_F3_TCTAAAATCGCAAAAAACTTTACGCA_13to15 2048 poolOne_99 29 60
#> 92 sample29_A_G12_CATGCAAATTAGAAGACCTTTAGGAA_20to21 0 poolNine_807 29 60
#> 93 sample8_A_O15_CAAGGAACACTAAAATAGGTTAGTAC_11to11 0 poolSeven_655 29 60
#> V6 V7 V8 V9
#> 90 247S247M * 0 0
#> 91 250M244H * 0 0
#> 92 244M * 0 0
#> 93 244M * 0 0
#> V10
#> 90 ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATTCTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGCGCGGCCAAATCCTGAATATCTTTGTTAATTTTCTGTCTCAATGATTTGTATAATATTGACAGTGGGGAGTTAAAGTCTCCCACTATTATTGTGTAGGAGTCGAAGTCTCTTTGTAGGTCTCTAAGAACTTGTTTTATGAATCTGAGTGCTCCTGTATTGGGTACATGTACATTTAGGATAGTTAGCTCTCCTTGTTGAATTGAACCCTTTACCATTACGTAATGCCCTTCTTTGTCTTTTTTAATCTTTG
#> 91 ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATTCTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGCGCGGCC
#> 92 ACAATATTAAGTCTTTCAACCTGTGAACATGGGATGTCTTTCCTTTTATTTGTATCTGCTTTAATTTCCTTCATCAAGGTTTTTTAGTTTCCAGTGTACAAGTCTTACACTTTCTTAAGTTTATTCCTATTTTATTATTTTTAATCCTATTGTAAATGGGATTCTTATGTCCTTTTTGTATAGTTTATTTTTAGTATATAGAAATGTCACTGATTTTTGTATGTTTTGTATGCTGCAACTTAAT
#> 93 AGATAAATTACATTCATGAAAGAAGCATATTATTTTTAAAGTACTTTATTTTGGAAAGGTAAAATGCTTGTGTAGTTATAATTTGGTTACTCTTGATTTCACCTTAGGAAAAACAATATCACCTTCTAACCATTTCTTTTTTAGTCAAATCTCTTGCTTCTATTTCTCTCTGTAGATCCGCTATTAAAGACTGTAATCACTGCTGCATCTTTCCTGTAAGGCTTGATCGCATTGTTAATTTCTT
#> V11 V12 V13 V14 V15 V16 V17 V18 V19
#> 90 * NM:i:1 ms:i:227 AS:i:227 nn:i:0 tp:A:P cm:i:20 s1:i:215 s2:i:0
#> 91 * NM:i:2 ms:i:210 AS:i:210 nn:i:0 tp:A:P cm:i:20 s1:i:226 s2:i:0
#> 92 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:21 s1:i:232 s2:i:0
#> 93 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:20 s1:i:223 s2:i:0
#> V20 V21 V22 V23
#> 90 de:f:0.0040 SA:Z:poolOne_99,29,+,250M244S,60,2; cs:Z::42*ct:204 rl:i:0
#> 91 de:f:0.0080 SA:Z:poolFour_343,26,+,247S247M,60,1; cs:Z::157*ct*tc:91 rl:i:0
#> 92 de:f:0 cs:Z::244 rl:i:0
#> 93 de:f:0 cs:Z::244 rl:i:0
Here is a related issue thread that might also help https://github.com/tidyverse/readr/issues/762.
Thanks for filing this bug report! Unfortunately because it's requires such specific set up to reproduce we believe it's unlikely to affect many people, and we don't have the development resources to fix at this time. It's our policy to close such issues to help stay focussed on the biggest problems, but the issue is still indexed by google, so if other people hit it, they'll be able to find it, and we can consider reopen it if it turns out to be a common problem. Thanks for reporting and I'm sorry we couldn't help more 😞.