Data Sorcerers  /  PyCon US 2025  /  Pittsburgh PA

IndoLLNet v1.0

A Novel Python CNN Algorithm Helping Increase Literacy of Traditional Culture for Modern Society — recognizing handwritten Javanese, Sundanese & Lampung scripts with 95–98% accuracy.

PyConUS 2025 CNN · Computer Vision NLP · OCR Cultural Preservation Python TensorFlow
◆ VIEW REPO ← ALL PROJECTS COLLABORATE ▶
0+
Contributors
95–98%
Accuracy
0
Scripts Covered
0
App Downloads
marchel@sys:~$ cat ./indollnet/abstract.md
indollnet@sys:~$ ./boot.sh
ABSTRACT.MD

The 2021 Javanese Script Congress proposed replacing Latin script with Javanese script in public life across Yogyakarta. However, this faces significant challenges due to the lack of widespread literacy in Javanese, especially among younger generations.

IndoLLNet introduces a CNN-based framework for handwritten character recognition covering Javanese, Sundanese, and Lampung scripts — handling diacritics (sandhangan), conjuncts (pasangan), numerals, and all six vowel sounds.

marchel@sys:~$ cat ./problem.log --stderr
⚠ Unicode Limitation
Javanese Unicode block has limited slots, making it impossible to represent all complex glyph combinations through text encoding alone. Image-based recognition is necessary.
⚠ Translation Failures
Existing apps misread sandhangan combinations — e.g., confusing "pepet" with "taling", corrupting words like presiden when typed in Javanese script.
⚠ Literacy Gap
Younger generations, local & foreign tourists, and digital platform users lack access to real-time image-to-text tools for traditional Nusantara scripts.
marchel@sys:~$ ls ./features/ --verbose
Multi-Script Support
Recognizes Javanese, Sundanese, and Lampung characters including full conjunct (pasangan) and diacritic (sandhangan) combinations.
Word-Level OCR
Goes beyond single character recognition — identifies entire words in context, handling real-world handwriting variation across 122+ contributors.
Vowel Disambiguation
Precisely differentiates all six vowel sounds: a, i, u, e, é, o — a critical distinction that previous tools consistently failed to handle.
marchel@sys:~$ cat ./model/architecture.json
CNN PIPELINE
INPUT
IMAGE
PATCH
CROP
RESIZE
11×8px
CONV
+BIAS
ReLU
+POOL
CLASS
OUTPUT

LAYER DEPTH VISUALIZATION
Input
11×8
Conv
×2
ReLU
Map
MaxPool
2×2
Flatten
FC
Softmax
Output
CONVOLUTION.PY
# Dimensions of input and kernel
input_h, input_w = len(input_matrix), len(input_matrix[0])
kernel_h, kernel_w = len(kernel), len(kernel[0])

# Output size
output_h = input_h − kernel_h + 1
output_w = input_w − kernel_w + 1

# Convolution loop
for i in range(output_h):
    for j in range(output_w):
        conv_sum = 0
        for ki in range(kernel_h):
            for kj in range(kernel_w):
                conv_sum += input_matrix[i+ki][j+kj] \
                            * kernel[ki][kj]
        output_matrix[i][j] = conv_sum

# ReLU + bias
bias = random.choice([−1, 1])
relu_map = [[max(0, v+bias) for v in row]
            for row in output_matrix]

# Max pooling 2×2, stride 2
pooled_map[i//2][j//2] = max(pooling_region)
marchel@sys:~$ ls ./dataset/ -la --contributors
DATASET.DAT
122+
Human Contributors
Students, academics, and native script writers across Yogyakarta
20
Aksara Jawa (Base)
ha na ca ra ka · da ta sa wa la · pa dha ja ya nya · ma ga ba tha nga
20
Pasangan (Conjuncts)
Dead-consonant conjunct forms for all 20 base characters
10+
Sandhangan (Diacritics)
Vowel markers, nasal markers (pangkon, wignyan, cecak, layar)
10
Angka Jawa (Numerals)
Javanese digit system 0–9
5
Aksara Rekan (Foreign)
kha · dza · fa/va · za · gha — loanword script forms
All contributors signed a Statement of Consent for collaboration and data utilization for research purposes only — non-commercial, GDPR-aligned.
AKSARA SHOWCASE — HA NA CA RA KA
ha
na
ca
ra
ka
da
ta
sa
wa
la
pa
dha
ja
ya
nya
ma
ga
ba
tha
nga
 Hover over any character to inspect. The full dataset sheet covers all base characters, pasangan, sandhangan, vowels, numerals, and foreign script forms.
marchel@sys:~$ python eval.py --model indollnet_v1 --report
EVAL REPORT — ACCURACY BY CLASS
Script Type Category Accuracy Bar
Javanese Base Characters (20) 98%
Javanese Pasangan (Conjuncts) 97%
Javanese Sandhangan (Diacritics) 96%
Javanese Vowel Disambiguation 95%
Javanese Numerals (0–9) 98%
Sundanese Base Characters 97%
Lampung Base Characters 96%
All Word-level Recognition 95%
LIVE INFERENCE DEMO
INPUT  › historical_doc_1913.jpg
SCRIPT › Javanese (Hanacaraka)
DETECTED CHARACTERS:
ꦏꦸꦭ kula 
98%
ꦤꦸꦮꦸꦤ꧀ nuwun
97%
ꦥꦚꦸꦮꦸꦤ꧀ panyuwun
96%
FULL TEXT OUTPUT >
Kula nuwun, panyuwunipun ampilan arta
waragad pangusungipun rêca saking dhusun
Gupala dhatêng sêtsiyun Srowot f 50...
Source: Historical document, June 24, 1913
marchel@sys:~$ cat ./context/org.txt
PROJECT INFO
PROJECT:IndoLLNet v1.0
VERSION:1.0 — Stable
ORG:Data Sorcerers / Studio Shodwe
PRESENTED:PyCon US 2025, Pittsburgh PA
DATASET:122+ contributors, Oct 2024
LICENSE:Research / Non-Commercial
FRAMEWORK:Python · TensorFlow · OpenCV
TECHNIQUE:CNN · Patch Crop · Max Pooling · ReLU
FUTURE:Web App · Mobile App · Social Media Integration
DATA SORCERERS — ORG.TXT

Data Sorcerers is an organization that prepares digital talent for the AI world through project-based and open source initiatives. It bridges students with professionals, practitioners, and academics to build a collaborative ecosystem.

 MOTTO: "Sorcery in Data, Magic in AI"
 MAIN PROJECT: IndoLLNet v1.0 (CV Classification, Nusantara Scripts)
 NEXT: Gamelan & Angklung 2.0 — Neural Soundscapes digitization
← ALL PROJECTS COLLABORATE ▶
marchel@sys:~$ cat ./team.json
PROJECT TEAM — DATA SORCERERS COUNCIL
M
Marchel Shevchenko
Lead · AI Architect
H
Hanuna Zoelkha
Dataset · Research
R
Auban Nur Rizqi
ML Engineer
Z
Zamroch Luluk
Data Collection
+
122+ Contributors
Handwriting Dataset
Studio Shodwe
Design · Presentation