Tokenization, corpus, encoder/decoder