from datasets import load_dataset
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering, Trainer, TrainingArguments
import torch
dataset = load_dataset("squad", split="train[:1%]") # Use only 1% for quick training
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForQuestionAnswering.from_pretrained(model_name)
def preprocess(data):
inputs = tokenizer(data["context"], data["question"], truncation=True, padding="max_length", max_length=384)
start_positions = []
end_positions = []
for i, answer in enumerate(data["answers"]):
start_positions.append(answer["answer_start"][0])
end_positions.append(answer["answer_start"][0] + len(answer["text"][0]))
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
dataset = dataset.map(preprocess, batched=True)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "start_positions", "end_positions"])
training_args = TrainingArguments(
output_dir="./quick_model",
per_device_train_batch_size=8,
num_train_epochs=1, # for my HP Victus only can train with this CPU 1% of sQuAD
save_steps=500,
evaluation_strategy="no",
fp16=torch.cuda.is_available(),
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
DistilBERT modelidan foydalanib, Savol-Javob (Question Answering - QA) tizimini o‘rgatish.
Bu kod Hugging Face Transformers kutubxonasi yordamida SQuAD (Stanford Question Answering Dataset) ma’lumotlar to‘plamida o‘qitilayotgan DistilBERT modelini yaratadi va uni qisqa vaqt ichida o‘qitadi.
Bu jarayonda:
✅ Ma’lumotlar yuklanadi
✅ Tokenizatsiya qilinadi (so‘zlar sonli kodlarga o‘tkaziladi)
✅ Model tayyorlanadi
✅ O‘qitish boshlanadi
📂 1. Ma’lumotlar to‘plami – SQuAD
SQuAD (Stanford Question Answering Dataset) nima?
Bu savol-javob uchun mo‘ljallangan ma’lumotlar to‘plami bo‘lib, unda:
Matn parchasi (context)
Berilgan savol (question)
Matnda javob bo‘lib keladigan bo‘lak (answer)bor. Modelni shunday o‘qitamizki, u matndan javobni topib berishni o‘rganadi.
`dataset = load_dataset("squad", split="train[:1%]```
SQuAD ma’lumotlar to‘plamining faqat 1% qismi yuklanmoqda, chunki HP Victus noutbukim protsessori faqat kichik datasetda ishlay oladi. Katta butun dataset uchun menda sharoit yo'q.
2. Model va Tokenizatsiya
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForQuestionAnswering.from_pretrained(model_name)
Bu qismda DistilBERT modeli chaqirilmoqda. U BERT modelining eng yengil (distillatsiya qilingan) versiyasidir.
Tokenizatsiya jarayoni so‘zlarni raqamlarga o‘tkazib, modelga tushunarli qilib beradi.