FastAI.jl icon indicating copy to clipboard operation
FastAI.jl copied to clipboard

Textmodel integration

Open Chandu-4444 opened this issue 3 years ago • 7 comments

Can do the following:

lm = FastText.LanguageModel(true)
classifier = FastText.TextClassifier(lm)
FastText.train_classifier!(classifier) # Would throw an error as I haven't fully enabled the model to work with FastAI's data container.

Chandu-4444 avatar Jul 21 '22 11:07 Chandu-4444

I think you've readded some files from TextModels.jl that we don't need, could you remove those? 🙂

lorenzoh avatar Jul 21 '22 18:07 lorenzoh

Sure! Will clean up in the next commit.

Chandu-4444 avatar Jul 21 '22 18:07 Chandu-4444

  • Deleted the files from TextModels.jl that are unnecessary.
  • Add a utility function for padding the batches for sequence data (Very naively implemented, as a starting point though)
julia> batches = FastText.load_batchseq(data, task)
julia> batches[1][1]
92-element Vector{Vector{Int64}}:
 [25000, 25000, 25000, 25000, 25000, 25000, 25000, 25000]
 [633779, 633779, 633779, 633779, 633779, 633779, 633779, 633779]
 [2731, 34, 315, 354, 2087, 2209, 70, 1307]
 [44047, 435, 633779, 633779, 6589, 633779, 633779, 205]
 ⋮
 [0, 0, 0, 0, 0, 213, 0, 0]
 [0, 0, 0, 0, 0, 25, 0, 0]
 [0, 0, 0, 0, 0, 1778, 0, 0]

julia> batches[1][2]
8-element Vector{Int64}:
 1
 1
 1
 1
 1
 0
 1
 1

Chandu-4444 avatar Jul 21 '22 21:07 Chandu-4444

  • fastai's way of encoding data doesn't include the removal of stop words (and Jeremy recommends it). So, I removed the stop word removal step.
  • I've added a vocab_size keyword argument to TextClassificationSingle.
  • Added <unk>, <pad> to the vocabulary.

Next:

  • A batch loader for textdata with padding length = max(sentence length in a batch).

Chandu-4444 avatar Jul 25 '22 09:07 Chandu-4444

Should the vocab CSV files be checked in? I would've assumed they would be artifacts or DataDeps as well.

ToucheSir avatar Jul 31 '22 20:07 ToucheSir

julia> data, blocks = load(datarecipes()["imdb"])
((mapobs(loadfile, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64})), mapobs(parentname, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64}))), (Paragraph(), Label{String}(["neg", "pos"])))

julia> task = TextClassificationSingle(blocks, data)
SupervisedTask(Paragraph -> Label{String})

julia> model = FastAI.taskmodel(task, FastText.LanguageModel(false, task))
#90 (generic function with 1 method)

julia> batches = FastText.load_batchseq(data, task)
6250-element Vector{Tuple{Vector{Vector{Int64}}, WARNING: both Losses and NNlib export "ctc_loss"; uses of it in module Flux must be qualified
Flux.OneHotArray{UInt32, 2, 1, 2, Vector{UInt32}}}}:
 ([[35, 35, 35, 35], [3, 3, 3, 9], [40, 18025, 15, 14], [224, 10, 3541, 3040], [737, 34, 24, 505], [49, 7, 809, 3], [4, 4, 221, 3836], [1927, 104, 4, 
3], [7, 16, 629, 28440], [6, 351, 7, 17]  …  [2, 2, 2, 44], [2, 2, 2, 3], [2, 2, 2, 9839], [2, 2, 2, 17], [2, 2, 2, 1041], [2, 2, 2, 27], [2, 2, 2, 3], [2, 2, 2, 3836], [2, 2, 2, 3], [2, 2, 2, 28440]], [0 0 1 1; 1 1 0 0])

julia> using FluxTraining

julia> td, vd = splitobs(batches, at=0.9)

julia> using Flux

julia> learner = Learner(model, Flux.Losses.logitcrossentropy, callbacks=[Metrics(accuracy)]; data=(td, vd))
Learner()

julia> fit!(learner, 1)
Epoch 1 TrainingPhase():   0%|█                                                                                               |  ETA: 4 days, 3:35:31 

Chandu-4444 avatar Aug 08 '22 09:08 Chandu-4444

The changes have been merged to #258.

Chandu-4444 avatar Aug 31 '22 13:08 Chandu-4444