IndoLLNet v1.0 — Data Sorcerers

marchel@sys:~$ cat ./indollnet/abstract.md

indollnet@sys:~$ ./boot.sh

ABSTRACT.MD

The 2021 Javanese Script Congress proposed replacing Latin script with Javanese script in public life across Yogyakarta. However, this faces significant challenges due to the lack of widespread literacy in Javanese, especially among younger generations.

IndoLLNet introduces a CNN-based framework for handwritten character recognition covering Javanese, Sundanese, and Lampung scripts — handling diacritics (sandhangan), conjuncts (pasangan), numerals, and all six vowel sounds.

marchel@sys:~$ cat ./problem.log --stderr

⚠ Unicode Limitation

Javanese Unicode block has limited slots, making it impossible to represent all complex glyph combinations through text encoding alone. Image-based recognition is necessary.

⚠ Translation Failures

Existing apps misread sandhangan combinations — e.g., confusing "pepet" with "taling", corrupting words like presiden when typed in Javanese script.

⚠ Literacy Gap

Younger generations, local & foreign tourists, and digital platform users lack access to real-time image-to-text tools for traditional Nusantara scripts.

marchel@sys:~$ ls ./features/ --verbose

ꦲ

Multi-Script Support

Recognizes Javanese, Sundanese, and Lampung characters including full conjunct (pasangan) and diacritic (sandhangan) combinations.

◈

Word-Level OCR

Goes beyond single character recognition — identifies entire words in context, handling real-world handwriting variation across 122+ contributors.

◎

Vowel Disambiguation

Precisely differentiates all six vowel sounds: a, i, u, e, é, o — a critical distinction that previous tools consistently failed to handle.

marchel@sys:~$ cat ./model/architecture.json

CNN PIPELINE

⬛ INPUT
IMAGE

▶

▦ PATCH
CROP

▶

◱ RESIZE
11×8px

▶

⊞ CONV
+BIAS

▶

◈ ReLU
+POOL

▶

◎ CLASS
OUTPUT

LAYER DEPTH VISUALIZATION

Input
11×8

▶

Conv
×2

▶

ReLU
Map

▶

MaxPool
2×2

▶

Flatten
FC

▶

Softmax
Output

CONVOLUTION.PY

# Dimensions of input and kernel
input_h, input_w = len(input_matrix), len(input_matrix[0])
kernel_h, kernel_w = len(kernel), len(kernel[0])

# Output size
output_h = input_h − kernel_h + 1
output_w = input_w − kernel_w + 1

# Convolution loop
for i in range(output_h):
    for j in range(output_w):
        conv_sum = 0
        for ki in range(kernel_h):
            for kj in range(kernel_w):
                conv_sum += input_matrix[i+ki][j+kj] \
                            * kernel[ki][kj]
        output_matrix[i][j] = conv_sum

# ReLU + bias
bias = random.choice([−1, 1])
relu_map = [[max(0, v+bias) for v in row]
            for row in output_matrix]

# Max pooling 2×2, stride 2
pooled_map[i//2][j//2] = max(pooling_region)

marchel@sys:~$ ls ./dataset/ -la --contributors

DATASET.DAT

122+

Human Contributors

Students, academics, and native script writers across Yogyakarta

Aksara Jawa (Base)

ha na ca ra ka · da ta sa wa la · pa dha ja ya nya · ma ga ba tha nga

Pasangan (Conjuncts)

Dead-consonant conjunct forms for all 20 base characters

10+

Sandhangan (Diacritics)

Vowel markers, nasal markers (pangkon, wignyan, cecak, layar)

Angka Jawa (Numerals)

Javanese digit system 0–9

Aksara Rekan (Foreign)

kha · dza · fa/va · za · gha — loanword script forms

All contributors signed a Statement of Consent for collaboration and data utilization for research purposes only — non-commercial, GDPR-aligned.

AKSARA SHOWCASE — HA NA CA RA KA

ꦲha

ꦤna

ꦕca

ꦫra

ꦏka

ꦢda

ꦠta

ꦱsa

ꦮwa

ꦭla

ꦥpa

ꦝdha

ꦗja

ꦪya

ꦚnya

ꦩma

ꦒga

ꦧba

ꦛtha

ꦔnga

◆ Hover over any character to inspect. The full dataset sheet covers all base characters, pasangan, sandhangan, vowels, numerals, and foreign script forms.

marchel@sys:~$ python eval.py --model indollnet_v1 --report

EVAL REPORT — ACCURACY BY CLASS

Script Type	Category	Accuracy
Javanese	Base Characters (20)	98%
Javanese	Pasangan (Conjuncts)	97%
Javanese	Sandhangan (Diacritics)	96%
Javanese	Vowel Disambiguation	95%
Javanese	Numerals (0–9)	98%
Sundanese	Base Characters	97%
Lampung	Base Characters	96%
All	Word-level Recognition	95%

LIVE INFERENCE DEMO

INPUT › historical_doc_1913.jpg

SCRIPT › Javanese (Hanacaraka)

DETECTED CHARACTERS:

ꦏꦸꦭ kula

98%

ꦤꦸꦮꦸꦤ꧀ nuwun

97%

ꦥꦚꦸꦮꦸꦤ꧀ panyuwun

96%

FULL TEXT OUTPUT >
Kula nuwun, panyuwunipun ampilan arta
waragad pangusungipun rêca saking dhusun
Gupala dhatêng sêtsiyun Srowot f 50...

Source: Historical document, June 24, 1913

marchel@sys:~$ cat ./context/org.txt

PROJECT INFO

PROJECT:IndoLLNet v1.0

VERSION:1.0 — Stable

ORG:Data Sorcerers / Studio Shodwe

PRESENTED:PyCon US 2025, Pittsburgh PA

DATASET:122+ contributors, Oct 2024

LICENSE:Research / Non-Commercial

FRAMEWORK:Python · TensorFlow · OpenCV

TECHNIQUE:CNN · Patch Crop · Max Pooling · ReLU

FUTURE:Web App · Mobile App · Social Media Integration

DATA SORCERERS — ORG.TXT

Data Sorcerers is an organization that prepares digital talent for the AI world through project-based and open source initiatives. It bridges students with professionals, practitioners, and academics to build a collaborative ecosystem.

◆ MOTTO: "Sorcery in Data, Magic in AI"

◆ MAIN PROJECT: IndoLLNet v1.0 (CV Classification, Nusantara Scripts)

◆ NEXT: Gamelan & Angklung 2.0 — Neural Soundscapes digitization

← ALL PROJECTS COLLABORATE ▶

marchel@sys:~$ cat ./team.json

PROJECT TEAM — DATA SORCERERS COUNCIL

Marchel Shevchenko

Lead · AI Architect

Hanuna Zoelkha

Dataset · Research

Auban Nur Rizqi

ML Engineer

Zamroch Luluk

Data Collection

122+ Contributors

Handwriting Dataset

◎

Studio Shodwe

Design · Presentation