1. First divide the image pixel value (w*h) by 448*448 to estimate the approximate number of segmentations.
2. According to the aspect ratio of the original image, determine the patch_size (patch_size is the length and width of the segmented sub-image, try to be similar to the pre-training aspect ratio) and the number of segmentation num_slice. The original image needs to be scaled by linear interpolation. It is guaranteed that complete segmentation can be performed according to the current patch_size (new pits will be opened later).
3. Split the image, mark each patch with its position, sort it, and add special characters:
<im_start> + <unk_token> * num + <im_end>
Among them, num represents how many tokens a subgraph is represented in the final language model.
<unk_token> represents a placeholder symbol, which will be replaced by image features later:
<im_start><im_end> are separate tokens at the beginning and end of the image boundary respectively.