%load_ext autoreload
%autoreload 2
%matplotlib inline
from lgm import *
# torch.cuda.get_device_name(torch.cuda.current_device())
path = untar_data(URLs.IMDB)
# path.ls()
We define a subclass of ItemList, called TextList, that will read the texts in the corresponding filenames.
ItemList is basically just a container that has length and can be indexed / sliced into.TextList just adds a method from_files that recursively lists files with the .txt extension in a path and reads their contentsJust in case there are some text log files, we restrict the ones we take to the training, test, and unsupervised folders.
textlist = TextList.from_files(path, include=['train', 'test', 'unsup'])
We should expect a total of 100,000 texts.
len(textlist.items)
Note the textlist contains just the paths to the files. The files are read "just in time" when we index/slice into the list.
# textlist
Here are the first 3 texts:
textlist[:3]
We split the data into a train set (90%) and a validation set (10%).
splitdata = SplitData.split_by_func(textlist, partial(random_splitter, proportion_valid=0.1))
splitdata
We need to tokenize the dataset first, which is splitting a sentence in individual tokens. Those tokens are the basic words or punctuation signs with some morphemic analysis -- "don't" is split into "do" and "n't" etc. We will use a processor for this, in conjunction with the spacy library.
Before tokenization, we clean up and process the texts in various ways:
After tokenization, we post-process the tokens a little more:
Since tokenizing and applying the rules before and after tokenization takes a bit of time, we'll parallelize all this using ProcessPoolExecutor to go faster.
proc_tok = TokenizeProcessor(max_workers=4)
Once we have tokenized our texts, we replace each token by an individual number, i.e., we numericalize the texts. Again, we do this with a processor.
When we do language modeling, we will infer the labels from the text during training, so there's no need to label. The training loop expects labels however, so we need to add dummy ones.
proc_num = NumericalizeProcessor()
The label_by_func function from the lgm library takes a splitdata object, i.e., a dataset split into train and validation, and:
# %time labeled_list = label_by_func(splitdata, lambda x: 0, processor_x = [proc_tok, proc_num])
Once the texts have been processed, they become lists of numbers, but we can still access the underlying raw data in x_obj (or y_obj for the targets, but we don't have any here).
labeled_list.train
labeled_list.train.x_obj(0)
labeled_list.train.x_obj(47)
labeled_list.valid
labeled_list.valid.x_obj(5)
We can also convert numericalized text back by using the deprocess method associated with the numericalization processor proc_num.
print(proc_num.deprocess(labeled_list.train[0][0]))
Compare with the original text (assuming the first text made it into train, not validation):
print(textlist[0])
Since the preprocessing takes time, we save the intermediate result using pickle. Don't use any lambda functions in your processors or they won't be able to pickle.
import pickle
# pickle.dump(labeled_list, open(path/'labeled_list_lm.pkl', 'wb'))
labeled_list = pickle.load(open(path/'labeled_list_lm.pkl', 'rb'))
Convert our labeled_list to a DataBunch, that is, batchifying it through a dataloader requires additional work.
This is done with an LM_PreLoader class, which processes texts for language models (hence LM) so that they can be fed into dataloaders (hence PreLoader):
shuffle=True) and create a big stream by concatenating all of thembatch_size smaller streamsLet's see how this works for the smaller validation set:
dl = DataLoader(LM_PreLoader(labeled_list.valid, shuffle=True), batch_size=64)
Let's check it all works ok: x1, y1, x2 and y2 should all be of size batch_size by bptt. The texts in each row of x1 should continue in x2. y1 and y2 should have the same texts as their x counterpart, shifted of one position to the right.
iter_dl = iter(dl)
x1,y1 = next(iter_dl)
x2,y2 = next(iter_dl)
x1.size(),y1.size()
print(proc_num.deprocess(x1[0]))
print(proc_num.deprocess(y1[0]))
print(proc_num.deprocess(x2[0]))
print(proc_num.deprocess(y2[0]))
Let's use a convenience function to do this quickly, for both the train and validation datasets.
# def get_lm_dls(train_ds, valid_ds, batch_size, bptt, **kwargs):
# return (DataLoader(LM_PreLoader(train_ds, batch_size, bptt, shuffle=True),
# batch_size=batch_size, **kwargs),
# DataLoader(LM_PreLoader(valid_ds, batch_size, bptt, shuffle=False),
# batch_size=2*batch_size, **kwargs))
# def lm_databunchify(splitdata, batch_size, bptt, **kwargs):
# return DataBunch(*get_lm_dls(splitdata.train, splitdata.valid, batch_size, bptt, **kwargs))
batch_size = 64
bptt = 70
data = lm_databunchify(labeled_list, batch_size, bptt)
Optional: check out the doc string for the __getitem__ method of LM_PreLoader for a simple example of how batching for language models works.
# LM_PreLoader.__getitem__?
When we will want to tackle classification, gathering the data will be a bit different: first we will label our texts with the folder they come from, and then we will need to apply padding to batch them together. To avoid mixing very long texts with very short ones, we will also use Sampler to sort our samples by length, with a bit of randomness for the training set.
We label our data with CategoryProcessor, which converts the labels (levels of the dependent categorical variable) to integers.
proc_cat = CategoryProcessor()
textlist = TextList.from_files(path, include=['train', 'test'])
splitdata = SplitData.split_by_func(textlist, partial(grandparent_splitter, valid_name='test'))
labeled_list = label_by_func(splitdata, parent_labeler,
processor_x = [proc_tok, proc_num],
processor_y=proc_cat)
proc_cat.level_dict
proc_cat.otoi
labeled_list.train
labeled_list.train.x_obj(0)
labeled_list.train.x_obj(47)
labeled_list.valid
labeled_list.valid.x_obj(5)
# import pickle
# pickle.dump(labeled_list, open(path/'labeled_list_clas.pkl', 'wb'))
labeled_list = pickle.load(open(path/'labeled_list_clas.pkl', 'rb'))
Let's check that the labels seem consistent with the texts.
[(labeled_list.train.x_obj(i), labeled_list.train.y_obj(i)) for i in [10, 732, 12552]]
For the validation set, we will simply sort the samples by length, and we begin with the longest ones for memory reasons (it's better to always have the biggest tensors first).
For the training set, we want some kind of randomness on top of this:
50 * batch_sizePadding: we add the padding token (id of 1) at the end of each sequence to make them all the same size when batching them. Note that we need padding at the end to be able to use PyTorch convenience functions that will let us ignore that padding (more on this later in the ULMFiT notebook).
# def pad_collate(samples, pad_idx=1, pad_first=False):
# # identify the longest document in the minibatch
# max_len = max([len(sample[0]) for sample in samples])
# # create rectangular tensor that can accommodate all documents
# # in the batch up to that max_len, and fill it with padding.
# results = torch.zeros(len(samples), max_len).long() + pad_idx
# # take documents in the minibatch and put them in the tensor
# # keeping padding either at the beginning or at the end
# for i, sample in enumerate(samples):
# if pad_first:
# results[i, -len(sample[0]):] = LongTensor(sample[0])
# else:
# results[i, :len(sample[0]) ] = LongTensor(sample[0])
# return results, tensor([sample[1] for sample in samples])
batch_size = 64
train_sampler = SortishSampler(labeled_list.train.x,
key=lambda t: len(labeled_list.train[int(t)][0]),
batch_size=batch_size)
train_dl = DataLoader(labeled_list.train, batch_size=batch_size,
sampler=train_sampler, collate_fn=pad_collate)
Let's look at one training batch:
iter_dl = iter(train_dl)
x, y = next(iter_dl)
We can see the padding at the end of the non-initial movie reviews:
x
x.size()
y
y.size()
Let's look at the lengths of the documents in this batch. We can get their length by subtracting the number of padding tokens from the length of the tensor:
lengths = []
for i in range(x.size(0)):
lengths.append(x.size(1) - (x[i]==1).sum().item())
print(lengths)
This is the first batch so it has the longest movie review first. The last one is the shortest movie review in the batch.
If we look at the next batch, we see the lengths fall within a much narrower range:
x,y = next(iter_dl)
lengths = []
for i in range(x.size(0)):
lengths.append(x.size(1) - (x[i]==1).sum().item())
print(lengths)
And we add a convenience function:
# def get_clas_dls(train_ds, valid_ds, batch_size, **kwargs):
# train_sampler = SortishSampler(train_ds.x,
# key=lambda t: len(train_ds.x[t]),
# batch_size=batch_size)
# valid_sampler = SortSampler(valid_ds.x,
# key=lambda t: len(valid_ds.x[t]))
# return (DataLoader(train_ds, batch_size=batch_size, sampler=train_sampler,
# collate_fn=pad_collate, **kwargs),
# DataLoader(valid_ds, batch_size=batch_size*2, sampler=valid_sampler,
# collate_fn=pad_collate, **kwargs))
# def clas_databunchify(splitdata, batch_size, **kwargs):
# return DataBunch(*get_clas_dls(splitdata.train, splitdata.valid, batch_size, **kwargs))
batch_size = 64
bptt = 70
data = clas_databunchify(labeled_list, batch_size)