java - Encoding problems with Stanford NER. Which encoding should I use? -


i'm having hard time find right encoding portuguese result appear properly. used command tag small sample model:

ps > java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.crfclassifier -loadclassifier ner-model.ser.gz -textfile tweets.txt 

as example, 1 of phrases follows:

meus pais na época em que eram casados e minha mãe era viva !. em praia de copacabana - posto 5 .

see special characters in example: "época" , "mãe", turn out in final result:

meus/o pais/o na/o época/o em/o que/o eram/o casados/o e/o minha/o mãe/o era/o viva/o !/o ./o em/o praia/b-location de/i-location copacabana/i-location -/o posto/b-location 5/i-location ./o

época = ├⌐poca

mãe = m├úe

not result expected.

it can happening when train model serious concern. tried use -encoding flag various options:

  • utf-8
  • iso-8859-15
  • iso-8859-1

i have success utf-8 or iso-8859-1, time having same result 1 showed above. made sure file encoded utf-8, tried export file encoding in utf-8 well, same result.

i don't know if has influence, i'm using powershell run commands.

what should solve problem?

thanks in advance.


Comments

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

unity3d - Unity SceneManager.LoadScene quits application -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -