tech.ml.dataset is there a way to replacing single entries?

This seems obvious.. but I can't figure out how to use the API to replace a single value in a table

I have a huge table and it has some data entry errors. I don't want to manually change the table in Excel b/c the corrections are a bit subjective and I'd like them documented

I want to make a separate EDN file of corrections

So it'd have something like:

Go to row 261 and change entry in the "Date" column to be "2012-04-03".

But I'm not seeing anything to access and modify single entries. The API seems to just allow you to do bulk column/row operations. I can row-map, but then I need to run through the whole table and "search" for the correct row to change (but I don't have the row index.. so that's also messy)

I'm a bit confused what my workflow should be here - or maybe this isn't the right tool for the job?

Jul 09 '25 13:07 kxygk

This is a great question.

What have you tried?

This is the first thing that I think of:

user> (def _ (add-lib 'org.scicloj/noj))
#'user/_
user> (require '[tech.v3.dataset :as ds])
nil
user> (def ds (ds/->>dataset {:x [17 42 999] :y ["1981-03-10" "1999-07-04" "2025-07-09"]}))
#'user/ds
user> ds
_unnamed [3 2]:

|  :x |         :y |
|----:|------------|
|  17 | 1981-03-10 |
|  42 | 1999-07-04 |
| 999 | 2025-07-09 |
user> (update ds :y assoc 1 "2012-04-03")
_unnamed [3 2]:

|  :x |         :y |
|----:|------------|
|  17 | 1981-03-10 |
|  42 | 2012-04-03 |
| 999 | 2025-07-09 |

Jul 09 '25 15:07 harold

Ohhhhh okay, so it was obvious :))

I really appreciate the pointer here. From the docs I understood the dataset could be used as a map to columns, but I didn't realize I could directly assoc in to the column seqs like that

Thank you so much for taking the time to help me :)) Really appreciate it @harold

Jul 10 '25 05:07 kxygk

Addendum, @harold :))

I guess the reason this didn't occur to me was.. Does this in effect clobber and re-write the whole column under the hood at each update? Or is there clever magic going on?

I'm just curious about the performance implications (though not very relevant for my usecase in all honesty)

(def testset
  (ds/->dataset [{:a 1
                  :b 2}
                 {:a 2
                  :c 3}]))

(:b testset)
;; => #tech.v3.dataset.column<int64>[2]
;;    :b
;;    [2, ]

(def newset
  (update testset
          :b
          assoc
          1
          "hello"))

(:b newset)
;; => #tech.v3.dataset.column<object>[2]
;;    :b
;;    [2, hello]

I note how the type of the column "drops" to Object, but if I have to guess, it's extracting a seq and then with assoc making a new mixed-type seq (which type of seq though.. I'm not sure..)

Jul 10 '25 09:07 kxygk

Thank you so much for taking the time to help me :))

You're so welcome.

More good questions.

My thoughts...

Ohhhhh okay, so it was obvious :))

idk about 'obvious', but hopefully easy.

but I didn't realize I could directly assoc in to the column seqs like that

we can assoc the column because it's Associative (https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/Associative.java)

Does this in effect clobber and re-write the whole column under the hood at each update? Or is there clever magic going on?

No magic.

No, it neither clobbers anything, nor necessarily copies everything.

Check it out.

user> (require '[tech.v3.dataset :as ds])
nil
user> (def ds (ds/->>dataset {:x [1 2]}))
#'user/ds
user> ds
_unnamed [2 1]:

| :x |
|---:|
|  1 |
|  2 |
user> (update ds :x assoc 1 17)
_unnamed [2 1]:

| :x |
|---:|
|  1 |
| 17 |
user> ds
_unnamed [2 1]:

| :x |
|---:|
|  1 |
|  2 |

So, nothing was clobbered. The dataset is a value.

Compare this with pandas...

$ python
Python 3.13.3 (main, Jun 16 2025, 18:15:32) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame({'x': [1 2]})
  File "<python-input-1>", line 1
    df = pd.DataFrame({'x': [1 2]})
                             ^^^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
>>> df = pd.DataFrame({'x': [1, 2]})
>>> df
   x
0  1
1  2
>>> df['x'][1] = 17
<python-input-4>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

>>> df
    x
0   1
1  17
>>> # UH OH SPAGHETTI Os

I'm just curious about the performance implications (though not very relevant for my usecase in all honesty)

It's gonna perform great, nothing to see here.

You're correct that your use case will never have performance problems, and if this pathway became a performance problem, that would be great; Clojure's polymorphism constructs make it easy to slip in a purpose built implementation here that will outperform anything.

I note how the type of the column "drops" to Object

I would say the column was 'promoted' to Object, which it must be since your column now contains numbers and strings.

Amazingly, the type is preserved in at least some cases:

user> (def ds (ds/->>dataset {:x [1 2]}))
#'user/ds
user> ds
_unnamed [2 1]:

| :x |
|---:|
|  1 |
|  2 |
user> (map meta (vals (update ds :x assoc 1 17)))
({:name :x, :datatype :int64, :n-elems 2})

Fun, and thought-provoking stuff. hth

Jul 10 '25 15:07 harold

Is a Column really Associative?

user> (def _ (add-lib 'org.scicloj/noj))
#'user/_
user> (def _ (add-lib 'io.github.tonsky/clojure-plus))
#'user/_
user> (require '[tech.v3.dataset :as ds])
nil
user> (def ds (ds/->dataset {:x [1 2 3]}))
#'user/ds
user> ds
_unnamed [3 1]:

| :x |
|---:|
|  1 |
|  2 |
|  3 |
user> (require 'clojure+.core)
nil
user> (clojure+.core/print-class-tree (class ds))
Dataset
├╴Object
├╴Counted
├╴IFn
│ ├╴Runnable
│ └╴Callable
├╴IHashEq
├╴ILookup
├╴IObj
│ └╴IMeta
├╴IPersistentMap
│ ├╴Associative
│ │ ├╴ILookup
│ │ └╴IPersistentCollection
│ │   └╴Seqable
│ ├╴Counted
│ └╴Iterable
├╴IType
├╴MapEquivalence
├╴Iterable
├╴Map
├╴PColumnCount
├╴PDataset
├╴PMissing
├╴PRowCount
├╴PSelectColumns
├╴PSelectRows
├╴PClone
├╴PCopyRawData
└╴PShape
nil
user> (clojure+.core/print-class-tree (class (:x ds)))
Column
├╴Object
├╴IType
├╴IMutList
│ ├╴Associative
│ │ ├╴ILookup
│ │ └╴IPersistentCollection
│ │   └╴Seqable
│ ├╴IHashEq
│ ├╴IKVReduce
│ ├╴IObj
│ │ └╴IMeta
│ ├╴IReduce
│ │ └╴IReduceInit
│ ├╴Indexed
│ │ └╴Counted
│ ├╴Reversible
│ ├╴Seqable
│ ├╴Sequential
│ ├╴IFnDef
│ │ └╴IFn
│ │   ├╴Runnable
│ │   └╴Callable
│ ├╴ITypedReduce
│ │ ├╴IReduce
│ │ │ └╴IReduceInit
│ │ └╴IReduceInit
│ ├╴ImmutSort
│ │ └╴List
│ │   └╴SequencedCollection
│ │     └╴Collection
│ │       └╴Iterable
│ ├╴RangeList
│ ├╴Cloneable
│ ├╴Comparable
│ ├╴List
│ │ └╴SequencedCollection
│ │   └╴Collection
│ │     └╴Iterable
│ └╴RandomAccess
├╴PColumn
├╴PColumnName
├╴PMissing
├╴PRowCount
├╴PSelectRows
├╴PClone
├╴PECount
├╴PElemwiseCast
├╴PElemwiseDatatype
├╴PElemwiseReaderCast
├╴POperationalElemwiseDatatype
├╴PSubBuffer
├╴PToArrayBuffer
├╴PToBuffer
├╴PToNativeBuffer
├╴PToReader
└╴PToWriter

(cc: @tonsky - wow!)

https://github.com/tonsky/clojure-plus/blob/708b9fa1155d0599c5d1e0a891d085aa93b7acb9/src/clojure%2B/core.clj#L163

Jul 11 '25 14:07 harold

(sorry for the slow response! Really appreciate your help groking this)

Ah okay, so the columns are not compact array-like structures? Sorry, I think it was a bit of a mix of terminologies on my end that caused the confusion. I guess when I said "clobbers" I meant forcing a complete re-write :)

I'm imagining the columns are compact b/c they need to be iterated over quickly (both for compactness and cache locality)

If you have a compact array-like of int64 .. I could see how you could insert another int64. But inserting a string promotes the column to Object and would require a full re-write of the whole column. Hence inserting a string would be much slower

I guess if under the hood it's something associative that's more like a vector then this isn't an issue. You do make a copy, but it's with reference to the old copy and so the performance is great. You do lose cache locality and compactness though

I'll have to look into the data types more again. I remember a lot of the details are explained very clearly in some video-talks. Just a bit harder to find that knowledge unlike a good blog post haha :)

Jul 12 '25 05:07 kxygk

Ah okay, so the columns are not compact array-like structures?

Of course they're compact array-like structures! How else could it be fast? 😄

But inserting a string promotes the column to Object and would require a full re-write of the whole column.

Copying the whole column is one possible implementation. But certainly not the only one.

I guess if under the hood it's something associative that's more like a vector then this isn't an issue.

Also correct.

The good news here is that Clojure's polymorphism constructs allow us to use whichever will perform best.

The thing that's really making me smile here is that in your particular case, you're getting both, for free (!), with zero effort.

When you load your "huge table", the data is stored in memory in 'compact array-like' structures, which support efficient computation. And (!), as we've seen, those same structures can be treated as associative, making it really easy to incorporate your 'separate EDN file of corrections'. And and (!!) all the while the dataset remains a value -- in the strong sense that conveying it really conveys information, unlike 'uh oh spaghetti-o time' in pandas.

Your lovely question lead us straight to the heart of functional data science. 🙇

Jul 13 '25 14:07 harold

Hmm, that kinda leave me with more questions than answers haha

To me you can have either compact arrays, or vectors (which are non-compact). So I'm curious how you're able to achieve both at the same time...? I'll try to dig deeper into the documentation and piece it together

Thank you for your help, I really appreciate it

Jul 13 '25 15:07 kxygk